AITopics | good data

2409.17605

Country: Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)

Genre: Research Report > Promising Solution (0.34)

Industry: Transportation > Ground > Road (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Park, Chanjun, Khang, Minsoo, Kim, Dahyun

Model-Based Data-Centric AI: Bridging the Divide Between Academic Ideals and Industrial Pragmatism

arXiv.org Artificial IntelligenceMar-4-2024

This paper delves into the contrasting roles of data within academic and industrial spheres, highlighting the divergence between Data-Centric AI and Model-Agnostic AI approaches. We argue that while Data-Centric AI focuses on the primacy of high-quality data for model performance, Model-Agnostic AI prioritizes algorithmic flexibility, often at the expense of data quality considerations. This distinction reveals that academic standards for data quality frequently do not meet the rigorous demands of industrial applications, leading to potential pitfalls in deploying academic models in real-world settings. Through a comprehensive analysis, we address these disparities, presenting both the challenges they pose and strategies for bridging the gap. Furthermore, we propose a novel paradigm: Model-Based Data-Centric AI, which aims to reconcile these differences by integrating model considerations into data optimization processes. This approach underscores the necessity for evolving data requirements that are sensitive to the nuances of both academic research and industrial deployment. By exploring these discrepancies, we aim to foster a more nuanced understanding of data's role in AI development and encourage a convergence of academic and industrial standards to enhance AI's real-world applicability.

application, arxiv preprint arxiv, dataset, (12 more...)

2403.01832

Genre: Research Report (1.00)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.73)

The Atlantic - TechnologyApr-4-2023, 13:00:00 GMT

AI Is Running Circles Around Robotics

When people imagine the AI apocalypse, they generally imagine robots. But the robot-takeover scenario most often envisioned by science fiction is not exactly looming. Recent and explosive progress in AI--along with recent and explosive hype surrounding it--has made the existential risks posed by the technology a topic of mainstream conversation. Yet progress in robotics--which is to say, machines capable of interacting with the physical world through motion and perception--has been lagging way behind. "I can't help but feel a little envious," said Eric Jang, the vice president of AI at the humanoid-robotics company 1X, in a talk at a robotics conference last year.

language model, robot, video, (15 more...)

The Atlantic - Technology

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

#artificialintelligenceJan-11-2023, 18:45:18 GMT

Cleanlab: Correct your data labels automatically and quickly – Towards AI

Originally published on Towards AI. I used an open-sourced library, cleanlab, to remove low-quality labels on an image dataset. The model trained on the dataset without low-quality data gained 4 percentage points of accuracy compared to the baseline model (trained on all data). Improving data quality sounds easy enough. But the workload of manually checking data quality can quickly become insurmountable as the dataset scales.

artificial intelligence, dataset, machine learning, (15 more...)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.30)

Vishwakarma, Harit, Lin, Heguang, Sala, Frederic, Vinayak, Ramya Korlakai

Good Data from Bad Models : Foundations of Threshold-based Auto-labeling

arXiv.org Artificial IntelligenceNov-22-2022

Creating large-scale high-quality labeled datasets is a major bottleneck in supervised machine learning workflows. Auto-labeling systems are a promising way to reduce reliance on manual labeling for dataset construction. Threshold-based auto-labeling, where validation data obtained from humans is used to find a threshold for confidence above which the data is machine-labeled, is emerging as a popular solution used widely in practice. Given the long shelf-life and diverse usage of the resulting datasets, understanding when the data obtained by such auto-labeling systems can be relied on is crucial. In this work, we analyze threshold-based auto-labeling systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data. Our results provide two insights. First, reasonable chunks of the unlabeled data can be automatically and accurately labeled by seemingly bad models. Second, a hidden downside of threshold-based auto-labeling systems is potentially prohibitive validation data usage. Together, these insights describe the promise and pitfalls of using such systems. We validate our theoretical guarantees with simulations and study the efficacy of threshold-based auto-labeling on real datasets.

artificial intelligence, machine learning, threshold-based auto-labeling, (3 more...)

2211.1262

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceSep-28-2022

Efficient Medical Image Assessment via Self-supervised Learning

Huang, Chun-Yin, Lei, Qi, Li, Xiaoxiao

High-performance deep learning methods typically rely on large annotated training datasets, which are difficult to obtain in many clinical applications due to the high cost of medical image labeling. Existing data assessment methods commonly require knowing the labels in advance, which are not feasible to achieve our goal of 'knowing which data to label.' To this end, we formulate and propose a novel and efficient data assessment strategy, EXponentiAl Marginal sINgular valuE (EXAMINE) score, to rank the quality of unlabeled medical image data based on their useful latent representations extracted via Self-supervised Learning (SSL) networks. Motivated by theoretical implication of SSL embedding space, we leverage a Masked Autoencoder for feature extraction. Furthermore, we evaluate data quality based on the marginal change of the largest singular value after excluding the data point in the dataset. We conduct extensive experiments on a pathology dataset. Our results indicate the effectiveness and efficiency of our proposed methods for selecting the most valuable data to label.

artificial intelligence, examine score, machine learning, (14 more...)

2209.14434

Country: North America > Canada > British Columbia (0.04)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

#artificialintelligenceMay-20-2022, 14:00:36 GMT

Council Post: Best Kept Secret In AI? Think Huge, Act Tiny

David Yunger is CEO of AI and software development firm Vaital. We were days away from IPO. We had raised $100 million in funding and exploded from a team of 50 in a garage to 600 in 18 months. One million technologists joined our platform. We were the next big deal.

best kept secret, council post, data-centric approach, (7 more...)

Country:

Asia > India (0.05)
Asia > China (0.05)

Technology: Information Technology > Artificial Intelligence (1.00)

#artificialintelligenceJul-25-2021, 05:00:27 GMT

Pinaki Laskar on LinkedIn: #MLOps #AI #machinelearning

AI Researcher, Cognitive Technologist Inventor - AI Thinking, Think Chain Innovator - AIOT, XAI, Autonomous Cars, IIOT Founder Fisheyebox Spatial Computing Savant, Transformative Leader, Industry X.0 Practitioner Why #MLOps is the key for productionized ML system? ML model code is only a small part ( 5–10%) of a successful ML system, and the objective should be to create value by placing ML models into production. F1 score) while stakeholders focus on business metrics (e.g. Improving labelling consistency is an iterative process, so consider repeating the process until disagreements are resolved as far as possible. For instance, partial automation with a human in the loop can be an ideal design for AI-based interpretation of medical scans, with human judgement coming in for cases where prediction confidence is low.

automation, ml system, pinaki laskar, (9 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications > Social Media (0.85)

#artificialintelligenceJun-26-2021, 03:25:49 GMT

Good data is a key component to AI innovation and machine learning

When the Biden Administration launched an AI task force earlier this month to create a path to "democratize access to research tools to promote AI," the goal of access was paramount. "The task force consists of some of the top experts in academia and industry," said Dinesh Manocha, a professor of computer science and electrical and computer engineering at the University of Maryland, on Federal Monthly Insights – Repurposing Manpower through Automation. "They recognize the importance and they're pushing for more development in the field by making good data available. So data is a very key component of AI and machine learning-based methods. Manocha said AI is as old as the field, pointing to "Founding Father" Alan Turing, whom he said laid the foundations in the 1950s. "Machine learning is one sub-area in the broader field of AI," said Manocha on Federal Drive with Tom Temin. "All the recent developments in AI, all the penetration in the real world, has primarily been driven by the excitement in last five to 10 years from machine learning." The breadth of AI and machine learning is quite evident by simply looking at classes offered at the University of Maryland and what students want to study. "A lot of computer science majors want to take AI," Manocha said. "Machine learning, by itself, has become such an important sub topic that we even offer multiple classes in it at the undergraduate and graduate levels." Focusing on data and algorithms, AI and machine learning imitate the way humans learn. "So you know one of the grand challenges in AI is how can we emulate human-like intelligence, which is still a big open problem," Manocha said. "There have been a lot of approaches pursued and proposed by a wonderful researchers the last 50 to 60 years." Manocha pointed to great advances in AI and machine learning by talking about the sub-branch of "deep learning," which imitates human knowledge and thinking. But there is still a ways to go. "If you have some data, you can easily get 50%-to-70% accuracy," Manocha said. "To go from 50-to-70, to 90%, you get 100x more data.

ai innovation and machine, key component, manocha, (7 more...)

Country: North America > United States > Maryland (0.49)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

#artificialintelligenceMay-16-2021, 04:05:08 GMT

Smart Water: Data Labeling with Active Learning And H2O.ai

Data is the food for AI. For machine Learning, or supervised learning, the golden labels are key for the models to recognize the pattern within the data. However, in the real-world data, it is usually hard to get large amount of labeled data, for example, search revelance, news topics, autopilot, etc. Recently, Angrew Ng gave a talk on MLOps: From Model-centric to Data-centric AI, where he mentioned the Idea from Big Data to Good Data. Good data is defined consistently and cover of important cases.

platform, prediction, unlabeled data, (10 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)